Predicting Dog Breeds Using Transfer Learning on SageMaker¶

In this notebook, we guide you through the creation of an Image Classification Machine Learning Model to distinguish among 133 dog breeds using the dog breed dataset from Udacity (https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip).

We utilize a pre-trained Resnet50 model from the PyTorch Vision library and add two fully connected neural network layers on top of it. Utilizing transfer learning, we freeze the existing convolutional layers in the Resnet50 model and only adjust the gradients for the two fully connected layers.

We then perform hyperparameter tuning to optimize the model. After fine-tuning using the best hyperparameters, we add profiling and debugging configurations to the training and evaluation phases. The final step involves deploying the model by creating a custom inference script for predictions.

Finally, we test the model using test images of dogs to ensure it meets our expectations.

In [2]:
# TODO: Install any packages that you might need
# For instance, you will need the smdebug package
!pip install smdebug
Keyring is skipped due to an exception: 'keyring.backends'
Collecting smdebug
  Downloading smdebug-1.0.12-py2.py3-none-any.whl (270 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 270.1/270.1 kB 2.6 MB/s eta 0:00:0000:01
Requirement already satisfied: protobuf>=3.6.0 in /opt/conda/lib/python3.7/site-packages (from smdebug) (3.20.3)
Requirement already satisfied: boto3>=1.10.32 in /opt/conda/lib/python3.7/site-packages (from smdebug) (1.26.24)
Requirement already satisfied: packaging in /opt/conda/lib/python3.7/site-packages (from smdebug) (20.1)
Requirement already satisfied: numpy>=1.16.0 in /opt/conda/lib/python3.7/site-packages (from smdebug) (1.21.6)
Collecting pyinstrument==3.4.2
  Downloading pyinstrument-3.4.2-py2.py3-none-any.whl (83 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 83.3/83.3 kB 1.4 MB/s eta 0:00:00:00:01
Collecting pyinstrument-cext>=0.2.2
  Downloading pyinstrument_cext-0.2.4-cp37-cp37m-manylinux2010_x86_64.whl (20 kB)
Requirement already satisfied: botocore<1.30.0,>=1.29.24 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (1.29.24)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (1.0.1)
Requirement already satisfied: s3transfer<0.7.0,>=0.6.0 in /opt/conda/lib/python3.7/site-packages (from boto3>=1.10.32->smdebug) (0.6.0)
Requirement already satisfied: pyparsing>=2.0.2 in /opt/conda/lib/python3.7/site-packages (from packaging->smdebug) (2.4.6)
Requirement already satisfied: six in /opt/conda/lib/python3.7/site-packages (from packaging->smdebug) (1.14.0)
Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3>=1.10.32->smdebug) (2.8.2)
Requirement already satisfied: urllib3<1.27,>=1.25.4 in /opt/conda/lib/python3.7/site-packages (from botocore<1.30.0,>=1.29.24->boto3>=1.10.32->smdebug) (1.26.13)
Installing collected packages: pyinstrument-cext, pyinstrument, smdebug
Successfully installed pyinstrument-3.4.2 pyinstrument-cext-0.2.4 smdebug-1.0.12
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip available: 22.3.1 -> 23.0
[notice] To update, run: pip install --upgrade pip
In [8]:
# TODO: Import any packages that you might need
# For instance you will need Boto3 and Sagemaker
import sagemaker
import boto3
from sagemaker.session import Session
from sagemaker import get_execution_role
# Initializing some useful variables
role = get_execution_role()
sagemaker_session = sagemaker.Session()
region = sagemaker_session.boto_region_name
bucket = sagemaker_session.default_bucket()
print(f"Region {region}")
print(f"Default s3 bucket : {bucket}")
Region us-west-2
Default s3 bucket : sagemaker-us-west-2-232496288858

Dataset¶

For this project, we used the dogImages dataset available at this link. It comprises images of 133 dog breeds, divided into train, validation, and test folders, each containing examples of every breed. For instance, the path to a sample image in the train folder is ./dogImages/test/018.Beauceron/Beauceron_01284.jpg.

In [6]:
#TODO: Fetch and upload the data to AWS S3

# Command to download and unzip data
!wget https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
!unzip dogImages.zip  > /dev/null
--2023-02-02 18:31:55--  https://s3-us-west-1.amazonaws.com/udacity-aind/dog-project/dogImages.zip
Resolving s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)... 52.219.194.0
Connecting to s3-us-west-1.amazonaws.com (s3-us-west-1.amazonaws.com)|52.219.194.0|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1132023110 (1.1G) [application/zip]
Saving to: ‘dogImages.zip’

dogImages.zip       100%[===================>]   1.05G  71.3MB/s    in 17s     

2023-02-02 18:32:14 (64.6 MB/s) - ‘dogImages.zip’ saved [1132023110/1132023110]

In [9]:
prefix ="dogImagesDataset"
print("Starting to upload dogImages")

inputs = sagemaker_session.upload_data(path="dogImages", bucket=bucket, key_prefix=prefix)
print(f"Input path ( S3 file path ): {inputs}")
Starting to upload dogImages
Input path ( S3 file path ): s3://sagemaker-us-west-2-232496288858/dogImagesDataset

Hyperparameter Tuning¶

For this image classification problem, we used a ResNet50 model with two fully connected linear neural network layers. ResNet-50 is a deep model, 50 layers deep, trained on one million images of 1000 categories from the ImageNet database, making it suitable for image recognition tasks. The optimizer we used is AdamW (for more information, see https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html). During the hyperparameter tuning process, the following hyperparameters were selected: Learning rate - default (0.001) with a range of 0.01 to 100; eps - default (1e-08) with a range of 1e-09 to 1e-08; weight decay - default (0.01) with a range of 0.1 to 10; batch size - only two values (64 and 128) were selected.

Note: You will need to use the hpo.py script to perform hyperparameter tuning.

In [28]:
#Importing all the required modules fomr tuner
from sagemaker.tuner import (
    CategoricalParameter,
    ContinuousParameter,
    HyperparameterTuner
)

# We wil be using AdamW as an optimizer which uses a different( more correct or better) way to calulate the weight decay related computations
# So we will be using weight_decay and eps hyperparamter tuning as well , along with the lerning rate and batchsize params
hyperparameter_ranges = {
    "lr": ContinuousParameter(0.0001, 0.1),
    "eps": ContinuousParameter(1e-9, 1e-8),
    "weight_decay": ContinuousParameter(1e-3, 1e-1),
    "batch_size": CategoricalParameter([ 64, 128]),
}
objective_metric_name = "average test loss"
objective_type = "Minimize"
metric_definitions = [{"Name": "average test loss", "Regex": "Test set: Average loss: ([0-9\\.]+)"}]
In [29]:
from  sagemaker.pytorch import PyTorch

estimator = PyTorch(
    entry_point = "hpo.py",
    base_job_name = "dog-breed-classification-hpo",
    role = role,
    instance_count = 1,
    instance_type = "ml.p3.2xlarge",
    py_version = "py36",
    framework_version = "1.8"
)

tuner = HyperparameterTuner(
    estimator,
    objective_metric_name,
    hyperparameter_ranges,
    metric_definitions,
    max_jobs=4,
    max_parallel_jobs=1,
    objective_type=objective_type, 
    early_stopping_type="Auto"
)
In [30]:
# TODO: Fit your HP Tuner
tuner.fit({"training": inputs }, wait=True)
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
No finished training job found associated with this estimator. Please make sure this estimator is only used for building workflow config
......................................................................................................................................................................................................................................................................................................................................................!
In [32]:
# Get the best estimators and the best HPs

best_estimator = tuner.best_estimator()

#Get the hyperparameters of the best trained model
best_estimator.hyperparameters()
2023-02-02 21:25:00 Starting - Found matching resource for reuse
2023-02-02 21:25:00 Downloading - Downloading input data
2023-02-02 21:25:00 Training - Training image download completed. Training in progress.
2023-02-02 21:25:00 Uploading - Uploading generated training model
2023-02-02 21:25:00 Completed - Resource reused by training job: pytorch-training-230202-2107-003-62d9c381
Out[32]:
{'_tuning_objective_metric': '"average test loss"',
 'batch_size': '"64"',
 'eps': '8.22789935548792e-09',
 'lr': '0.00023934090400595828',
 'sagemaker_container_log_level': '20',
 'sagemaker_estimator_class_name': '"PyTorch"',
 'sagemaker_estimator_module': '"sagemaker.pytorch.estimator"',
 'sagemaker_job_name': '"dog-breed-classification-hpo-2023-02-02-21-07-24-934"',
 'sagemaker_program': '"hpo.py"',
 'sagemaker_region': '"us-west-2"',
 'sagemaker_submit_directory': '"s3://sagemaker-us-west-2-232496288858/dog-breed-classification-hpo-2023-02-02-21-07-24-934/source/sourcedir.tar.gz"',
 'weight_decay': '0.0013574056448429658'}
In [33]:
best_hyperparameters={'batch_size': int(best_estimator.hyperparameters()['batch_size'].replace('"', "")),
                      'eps': best_estimator.hyperparameters()['eps'],
                      'lr': best_estimator.hyperparameters()['lr'],
                      'weight_decay': best_estimator.hyperparameters()['weight_decay'],}
print(f"Best Hyperparamters post Hyperparameter fine tuning are : \n {best_hyperparameters}")
Best Hyperparamters post Hyperparameter fine tuning are : 
 {'batch_size': 64, 'eps': '8.22789935548792e-09', 'lr': '0.00023934090400595828', 'weight_decay': '0.0013574056448429658'}

Model Profiling and Debugging¶

TODO: Using the best hyperparameters, create and finetune a new model

Note: You will need to use the train_model.py script to perform model profiling and debugging.

In [34]:
# Setting up debugger and profiler rules and configs
from sagemaker.debugger import (
    Rule,
    rule_configs, 
    ProfilerRule,
    DebuggerHookConfig,
    CollectionConfig,
    ProfilerConfig,
    FrameworkProfile
)


rules = [
    Rule.sagemaker(rule_configs.vanishing_gradient()),
    Rule.sagemaker(rule_configs.overfit()),
    Rule.sagemaker(rule_configs.overtraining()),
    Rule.sagemaker(rule_configs.poor_weight_initialization()),
    ProfilerRule.sagemaker(rule_configs.ProfilerReport()),
]

profiler_config = ProfilerConfig(
    system_monitor_interval_millis=500, framework_profile_params=FrameworkProfile(num_steps=10)
)

collection_configs=[CollectionConfig(name="CrossEntropyLoss_output_0",parameters={
    "include_regex": "CrossEntropyLoss_output_0", "train.save_interval": "10","eval.save_interval": "1"})]

debugger_config=DebuggerHookConfig( collection_configs=collection_configs )
In [35]:
# Create and fit an estimator
estimator = PyTorch(
    entry_point="train_model.py",
    instance_count=1,
    instance_type="ml.p3.2xlarge",
    role=role,
    framework_version="1.6", #using 1.6 as it has support for smdebug lib , https://github.com/awslabs/sagemaker-debugger#debugger-supported-frameworks
    py_version="py36",
    hyperparameters=best_hyperparameters,
    profiler_config=profiler_config, # include the profiler hook
    debugger_hook_config=debugger_config, # include the debugger hook
    rules=rules
)

estimator.fit({'train' : inputs },wait=True)
2023-02-02 21:42:09 Starting - Starting the training job...
2023-02-02 21:42:35 Starting - Preparing the instances for trainingVanishingGradient: InProgress
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress
ProfilerReport: InProgress
......
2023-02-02 21:43:37 Downloading - Downloading input data......
2023-02-02 21:44:34 Training - Downloading the training image......
2023-02-02 21:45:34 Training - Training image download completed. Training in progress...bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2023-02-02 21:45:54,173 sagemaker-training-toolkit INFO     Imported framework sagemaker_pytorch_container.training
2023-02-02 21:45:54,209 sagemaker_pytorch_container.training INFO     Block until all host DNS lookups succeed.
2023-02-02 21:45:54,212 sagemaker_pytorch_container.training INFO     Invoking user training script.
2023-02-02 21:45:54,512 sagemaker-training-toolkit INFO     Invoking user script
Training Env:
{
    "additional_framework_parameters": {},
    "channel_input_dirs": {
        "train": "/opt/ml/input/data/train"
    },
    "current_host": "algo-1",
    "framework_module": "sagemaker_pytorch_container.training:main",
    "hosts": [
        "algo-1"
    ],
    "hyperparameters": {
        "batch_size": 64,
        "eps": "8.22789935548792e-09",
        "lr": "0.00023934090400595828",
        "weight_decay": "0.0013574056448429658"
    },
    "input_config_dir": "/opt/ml/input/config",
    "input_data_config": {
        "train": {
            "TrainingInputMode": "File",
            "S3DistributionType": "FullyReplicated",
            "RecordWrapperType": "None"
        }
    },
    "input_dir": "/opt/ml/input",
    "is_master": true,
    "job_name": "pytorch-training-2023-02-02-21-42-08-475",
    "log_level": 20,
    "master_hostname": "algo-1",
    "model_dir": "/opt/ml/model",
    "module_dir": "s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/source/sourcedir.tar.gz",
    "module_name": "train_model",
    "network_interface_name": "eth0",
    "num_cpus": 8,
    "num_gpus": 1,
    "output_data_dir": "/opt/ml/output/data",
    "output_dir": "/opt/ml/output",
    "output_intermediate_dir": "/opt/ml/output/intermediate",
    "resource_config": {
        "current_host": "algo-1",
        "current_instance_type": "ml.p3.2xlarge",
        "current_group_name": "homogeneousCluster",
        "hosts": [
            "algo-1"
        ],
        "instance_groups": [
            {
                "instance_group_name": "homogeneousCluster",
                "instance_type": "ml.p3.2xlarge",
                "hosts": [
                    "algo-1"
                ]
            }
        ],
        "network_interface_name": "eth0"
    },
    "user_entry_point": "train_model.py"
}
Environment variables:
SM_HOSTS=["algo-1"]
SM_NETWORK_INTERFACE_NAME=eth0
SM_HPS={"batch_size":64,"eps":"8.22789935548792e-09","lr":"0.00023934090400595828","weight_decay":"0.0013574056448429658"}
SM_USER_ENTRY_POINT=train_model.py
SM_FRAMEWORK_PARAMS={}
SM_RESOURCE_CONFIG={"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.p3.2xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.p3.2xlarge"}],"network_interface_name":"eth0"}
SM_INPUT_DATA_CONFIG={"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}}
SM_OUTPUT_DATA_DIR=/opt/ml/output/data
SM_CHANNELS=["train"]
SM_CURRENT_HOST=algo-1
SM_MODULE_NAME=train_model
SM_LOG_LEVEL=20
SM_FRAMEWORK_MODULE=sagemaker_pytorch_container.training:main
SM_INPUT_DIR=/opt/ml/input
SM_INPUT_CONFIG_DIR=/opt/ml/input/config
SM_OUTPUT_DIR=/opt/ml/output
SM_NUM_CPUS=8
SM_NUM_GPUS=1
SM_MODEL_DIR=/opt/ml/model
SM_MODULE_DIR=s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/source/sourcedir.tar.gz
SM_TRAINING_ENV={"additional_framework_parameters":{},"channel_input_dirs":{"train":"/opt/ml/input/data/train"},"current_host":"algo-1","framework_module":"sagemaker_pytorch_container.training:main","hosts":["algo-1"],"hyperparameters":{"batch_size":64,"eps":"8.22789935548792e-09","lr":"0.00023934090400595828","weight_decay":"0.0013574056448429658"},"input_config_dir":"/opt/ml/input/config","input_data_config":{"train":{"RecordWrapperType":"None","S3DistributionType":"FullyReplicated","TrainingInputMode":"File"}},"input_dir":"/opt/ml/input","is_master":true,"job_name":"pytorch-training-2023-02-02-21-42-08-475","log_level":20,"master_hostname":"algo-1","model_dir":"/opt/ml/model","module_dir":"s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/source/sourcedir.tar.gz","module_name":"train_model","network_interface_name":"eth0","num_cpus":8,"num_gpus":1,"output_data_dir":"/opt/ml/output/data","output_dir":"/opt/ml/output","output_intermediate_dir":"/opt/ml/output/intermediate","resource_config":{"current_group_name":"homogeneousCluster","current_host":"algo-1","current_instance_type":"ml.p3.2xlarge","hosts":["algo-1"],"instance_groups":[{"hosts":["algo-1"],"instance_group_name":"homogeneousCluster","instance_type":"ml.p3.2xlarge"}],"network_interface_name":"eth0"},"user_entry_point":"train_model.py"}
SM_USER_ARGS=["--batch_size","64","--eps","8.22789935548792e-09","--lr","0.00023934090400595828","--weight_decay","0.0013574056448429658"]
SM_OUTPUT_INTERMEDIATE_DIR=/opt/ml/output/intermediate
SM_CHANNEL_TRAIN=/opt/ml/input/data/train
SM_HP_BATCH_SIZE=64
SM_HP_EPS=8.22789935548792e-09
SM_HP_LR=0.00023934090400595828
SM_HP_WEIGHT_DECAY=0.0013574056448429658
PYTHONPATH=/opt/ml/code:/opt/conda/bin:/opt/conda/lib/python36.zip:/opt/conda/lib/python3.6:/opt/conda/lib/python3.6/lib-dynload:/opt/conda/lib/python3.6/site-packages
Invoking script with the following command:
/opt/conda/bin/python3.6 train_model.py --batch_size 64 --eps 8.22789935548792e-09 --lr 0.00023934090400595828 --weight_decay 0.0013574056448429658
[2023-02-02 21:45:55.407 algo-1:27 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2023-02-02 21:45:55.652 algo-1:27 INFO profiler_config_parser.py:102] Using config at /opt/ml/input/config/profilerconfig.json.
Running on Device cuda:0
Hyperparameters : LR: 0.00023934090400595828,  Eps: 8.22789935548792e-09, Weight-decay: 0.0013574056448429658, Batch Size: 64, Epoch: 2
Data Dir Path: /opt/ml/input/data/train
Model Dir  Path: /opt/ml/model
Output Dir  Path: /opt/ml/output/data
[2023-02-02 21:46:00.572 algo-1:27 INFO json_config.py:91] Creating hook from json_config at /opt/ml/input/config/debughookconfig.json.
[2023-02-02 21:46:00.574 algo-1:27 INFO hook.py:199] tensorboard_dir has not been set for the hook. SMDebug will not be exporting tensorboard summaries.
[2023-02-02 21:46:00.575 algo-1:27 INFO hook.py:253] Saving to /opt/ml/output/tensors
[2023-02-02 21:46:00.576 algo-1:27 INFO state_store.py:77] The checkpoint config file /opt/ml/input/config/checkpointconfig.json does not exist.
[2023-02-02 21:46:00.606 algo-1:27 INFO hook.py:584] name:fc.0.weight count_params:524288
[2023-02-02 21:46:00.607 algo-1:27 INFO hook.py:584] name:fc.0.bias count_params:256
[2023-02-02 21:46:00.607 algo-1:27 INFO hook.py:584] name:fc.2.weight count_params:34048
[2023-02-02 21:46:00.608 algo-1:27 INFO hook.py:584] name:fc.2.bias count_params:133
[2023-02-02 21:46:00.608 algo-1:27 INFO hook.py:586] Total Trainable Params: 558725
Epoch 1 - Starting Training phase.
Epoch: 1 - Training Model on Complete Training Dataset!
[2023-02-02 21:46:01.747 algo-1:27 INFO hook.py:413] Monitoring the collections: gradients, CrossEntropyLoss_output_0, relu_input, losses
[2023-02-02 21:46:01.749 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/prestepzero-*-start-1675374355652793.2_train-0-stepstart-1675374361749233.0/python_stats.
[2023-02-02 21:46:01.768 algo-1:27 INFO hook.py:476] Hook is writing from the hook with pid: 27
[2023-02-02 21:46:12.376 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-0-stepstart-1675374361762277.0_train-0-forwardpassend-1675374372375976.8/python_stats.
[2023-02-02 21:46:13.425 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-0-forwardpassend-1675374372385274.0_train-1-stepstart-1675374373424226.2/python_stats.
[2023-02-02 21:46:17.588 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-1-stepstart-1675374373430919.2_train-1-forwardpassend-1675374377588009.8/python_stats.
[2023-02-02 21:46:18.463 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-1-forwardpassend-1675374377591750.0_train-2-stepstart-1675374378462369.8/python_stats.
[2023-02-02 21:46:22.450 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-2-stepstart-1675374378466878.5_train-2-forwardpassend-1675374382408504.0/python_stats.
[2023-02-02 21:46:23.425 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-2-forwardpassend-1675374382451907.0_train-3-stepstart-1675374383424827.8/python_stats.
[2023-02-02 21:46:27.395 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-3-stepstart-1675374383428854.8_train-3-forwardpassend-1675374387394790.0/python_stats.
[2023-02-02 21:46:28.687 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-3-forwardpassend-1675374387397322.8_train-4-stepstart-1675374388686763.0/python_stats.
[2023-02-02 21:46:32.160 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-4-stepstart-1675374388694483.5_train-4-forwardpassend-1675374392159810.5/python_stats.
[2023-02-02 21:46:33.176 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-4-forwardpassend-1675374392161803.0_train-5-stepstart-1675374393175468.0/python_stats.
[2023-02-02 21:46:36.537 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-5-stepstart-1675374393180429.5_train-5-forwardpassend-1675374396537505.8/python_stats.
[2023-02-02 21:46:37.280 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-5-forwardpassend-1675374396539309.8_train-6-stepstart-1675374397279402.8/python_stats.
[2023-02-02 21:46:40.664 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-6-stepstart-1675374397283719.5_train-6-forwardpassend-1675374400664573.8/python_stats.
[2023-02-02 21:46:41.566 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-6-forwardpassend-1675374400666649.0_train-7-stepstart-1675374401565724.2/python_stats.
[2023-02-02 21:46:44.883 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-7-stepstart-1675374401570113.2_train-7-forwardpassend-1675374404882778.0/python_stats.
[2023-02-02 21:46:45.788 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-7-forwardpassend-1675374404884567.0_train-8-stepstart-1675374405788208.2/python_stats.
[2023-02-02 21:46:49.176 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-8-stepstart-1675374405792509.0_train-8-forwardpassend-1675374409176302.5/python_stats.
[2023-02-02 21:46:50.198 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-8-forwardpassend-1675374409178382.2_train-9-stepstart-1675374410198045.0/python_stats.
[2023-02-02 21:46:53.590 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-9-stepstart-1675374410202298.8_train-9-forwardpassend-1675374413589852.5/python_stats.
[2023-02-02 21:46:54.623 algo-1:27 INFO python_profiler.py:182] Dumping cProfile stats to /opt/ml/output/profiler/framework/pytorch/cprofile/27-algo-1/train-9-forwardpassend-1675374413591964.8_train-10-stepstart-1675374414622837.2/python_stats.
Train set: Average loss: 4.5436, Accuracy: 1105/6680 (17%)
Epoch 1 - Starting Testing phase.
Epoch: 1 - Testing Model on Complete Testing Dataset!
Test set: Average loss: 4.1145, Accuracy: 216/836 (26%)
Epoch 2 - Starting Training phase.
Epoch: 2 - Training Model on Complete Training Dataset!
Train set: Average loss: 3.8656, Accuracy: 2074/6680 (31%)
Epoch 2 - Starting Testing phase.
Epoch: 2 - Testing Model on Complete Testing Dataset!
Test set: Average loss: 3.6677, Accuracy: 287/836 (34%)
Starting to Save the Model
Completed Saving the Model
INFO:__main__:Running on Device cuda:0
INFO:__main__:Hyperparameters : LR: 0.00023934090400595828,  Eps: 8.22789935548792e-09, Weight-decay: 0.0013574056448429658, Batch Size: 64, Epoch: 2
INFO:__main__:Data Dir Path: /opt/ml/input/data/train
INFO:__main__:Model Dir  Path: /opt/ml/model
INFO:__main__:Output Dir  Path: /opt/ml/output/data
Downloading: "https://download.pytorch.org/models/resnet50-19c8e357.pth" to /root/.cache/torch/hub/checkpoints/resnet50-19c8e357.pth
#015  0%|          | 0.00/97.8M [00:00<?, ?B/s]#015  4%|▍         | 3.98M/97.8M [00:00<00:02, 41.8MB/s]#015  9%|▉         | 9.01M/97.8M [00:00<00:02, 44.5MB/s]#015 15%|█▍        | 14.5M/97.8M [00:00<00:01, 47.7MB/s]#015 21%|██        | 20.3M/97.8M [00:00<00:01, 51.1MB/s]#015 27%|██▋       | 26.3M/97.8M [00:00<00:01, 54.0MB/s]#015 33%|███▎      | 32.4M/97.8M [00:00<00:01, 56.8MB/s]#015 40%|███▉      | 38.6M/97.8M [00:00<00:01, 59.0MB/s]#015 46%|████▌     | 44.8M/97.8M [00:00<00:00, 60.6MB/s]#015 52%|█████▏    | 51.0M/97.8M [00:00<00:00, 61.8MB/s]#015 59%|█████▊    | 57.3M/97.8M [00:01<00:00, 63.0MB/s]#015 65%|██████▍   | 63.2M/97.8M [00:01<00:00, 62.7MB/s]#015 71%|███████   | 69.4M/97.8M [00:01<00:00, 63.4MB/s]#015 77%|███████▋  | 75.6M/97.8M [00:01<00:00, 63.7MB/s]#015 84%|████████▎ | 81.8M/97.8M [00:01<00:00, 64.0MB/s]#015 90%|████████▉ | 87.9M/97.8M [00:01<00:00, 64.2MB/s]#015 96%|█████████▋| 94.2M/97.8M [00:01<00:00, 64.6MB/s]#015100%|██████████| 97.8M/97.8M [00:01<00:00, 62.0MB/s]
INFO:__main__:Epoch 1 - Starting Training phase.
INFO:__main__:Epoch: 1 - Training Model on Complete Training Dataset!
INFO:__main__:
Train set: Average loss: 4.5436, Accuracy: 1105/6680 (17%)
INFO:__main__:Epoch 1 - Starting Testing phase.
INFO:__main__:Epoch: 1 - Testing Model on Complete Testing Dataset!
INFO:__main__:
Test set: Average loss: 4.1145, Accuracy: 216/836 (26%)
INFO:__main__:Epoch 2 - Starting Training phase.
INFO:__main__:Epoch: 2 - Training Model on Complete Training Dataset!
INFO:__main__:
Train set: Average loss: 3.8656, Accuracy: 2074/6680 (31%)
INFO:__main__:Epoch 2 - Starting Testing phase.
INFO:__main__:Epoch: 2 - Testing Model on Complete Testing Dataset!
2023-02-02 21:50:25,716 sagemaker-training-toolkit INFO     Reporting training SUCCESS
INFO:__main__:
Test set: Average loss: 3.6677, Accuracy: 287/836 (34%)
INFO:__main__:Starting to Save the Model
INFO:__main__:Completed Saving the Model
VanishingGradient: Error
Overfit: InProgress
Overtraining: InProgress
PoorWeightInitialization: InProgress

2023-02-02 21:51:36 Uploading - Uploading generated training modelVanishingGradient: Error
Overfit: InProgress
Overtraining: IssuesFound
PoorWeightInitialization: Error

2023-02-02 21:52:04 Completed - Training job completed
VanishingGradient: Error
Overfit: NoIssuesFound
Overtraining: IssuesFound
PoorWeightInitialization: Error
Training seconds: 493
Billable seconds: 493
In [36]:
#fetching jobname , client and description to be used for plotting.
job_name = estimator.latest_training_job.name
client = estimator.sagemaker_session.sagemaker_client
description = client.describe_training_job(TrainingJobName=estimator.latest_training_job.name)
print(f"Jobname: {job_name}")
print(f"Client: {client}")
print(f"Description: {description}")
Jobname: pytorch-training-2023-02-02-21-42-08-475
Client: <botocore.client.SageMaker object at 0x7f7cff3789d0>
Description: {'TrainingJobName': 'pytorch-training-2023-02-02-21-42-08-475', 'TrainingJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:training-job/pytorch-training-2023-02-02-21-42-08-475', 'ModelArtifacts': {'S3ModelArtifacts': 's3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/output/model.tar.gz'}, 'TrainingJobStatus': 'Completed', 'SecondaryStatus': 'Completed', 'HyperParameters': {'batch_size': '64', 'eps': '"8.22789935548792e-09"', 'lr': '"0.00023934090400595828"', 'sagemaker_container_log_level': '20', 'sagemaker_job_name': '"pytorch-training-2023-02-02-21-42-08-475"', 'sagemaker_program': '"train_model.py"', 'sagemaker_region': '"us-west-2"', 'sagemaker_submit_directory': '"s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/source/sourcedir.tar.gz"', 'weight_decay': '"0.0013574056448429658"'}, 'AlgorithmSpecification': {'TrainingImage': '763104351884.dkr.ecr.us-west-2.amazonaws.com/pytorch-training:1.6-gpu-py36', 'TrainingInputMode': 'File', 'EnableSageMakerMetricsTimeSeries': True}, 'RoleArn': 'arn:aws:iam::232496288858:role/service-role/AmazonSageMaker-ExecutionRole-20230202T190502', 'InputDataConfig': [{'ChannelName': 'train', 'DataSource': {'S3DataSource': {'S3DataType': 'S3Prefix', 'S3Uri': 's3://sagemaker-us-west-2-232496288858/dogImagesDataset', 'S3DataDistributionType': 'FullyReplicated'}}, 'CompressionType': 'None', 'RecordWrapperType': 'None'}], 'OutputDataConfig': {'KmsKeyId': '', 'S3OutputPath': 's3://sagemaker-us-west-2-232496288858/'}, 'ResourceConfig': {'InstanceType': 'ml.p3.2xlarge', 'InstanceCount': 1, 'VolumeSizeInGB': 30}, 'StoppingCondition': {'MaxRuntimeInSeconds': 86400}, 'CreationTime': datetime.datetime(2023, 2, 2, 21, 42, 9, 93000, tzinfo=tzlocal()), 'TrainingStartTime': datetime.datetime(2023, 2, 2, 21, 43, 37, tzinfo=tzlocal()), 'TrainingEndTime': datetime.datetime(2023, 2, 2, 21, 51, 50, 516000, tzinfo=tzlocal()), 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 16, 683000, tzinfo=tzlocal()), 'SecondaryStatusTransitions': [{'Status': 'Starting', 'StartTime': datetime.datetime(2023, 2, 2, 21, 42, 9, 93000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 43, 37, tzinfo=tzlocal()), 'StatusMessage': 'Preparing the instances for training'}, {'Status': 'Downloading', 'StartTime': datetime.datetime(2023, 2, 2, 21, 43, 37, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 44, 32, 551000, tzinfo=tzlocal()), 'StatusMessage': 'Downloading input data'}, {'Status': 'Training', 'StartTime': datetime.datetime(2023, 2, 2, 21, 44, 32, 551000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 51, 30, 100000, tzinfo=tzlocal()), 'StatusMessage': 'Training image download completed. Training in progress.'}, {'Status': 'Uploading', 'StartTime': datetime.datetime(2023, 2, 2, 21, 51, 30, 100000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 51, 50, 516000, tzinfo=tzlocal()), 'StatusMessage': 'Uploading generated training model'}, {'Status': 'Completed', 'StartTime': datetime.datetime(2023, 2, 2, 21, 51, 50, 516000, tzinfo=tzlocal()), 'EndTime': datetime.datetime(2023, 2, 2, 21, 51, 50, 516000, tzinfo=tzlocal()), 'StatusMessage': 'Training job completed'}], 'EnableNetworkIsolation': False, 'EnableInterContainerTrafficEncryption': False, 'EnableManagedSpotTraining': False, 'TrainingTimeInSeconds': 493, 'BillableTimeInSeconds': 493, 'DebugHookConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-232496288858/', 'CollectionConfigurations': [{'CollectionName': 'relu_input', 'CollectionParameters': {'include_regex': '.*relu_input', 'save_interval': '500'}}, {'CollectionName': 'CrossEntropyLoss_output_0', 'CollectionParameters': {'eval.save_interval': '1', 'include_regex': 'CrossEntropyLoss_output_0', 'train.save_interval': '10'}}, {'CollectionName': 'gradients', 'CollectionParameters': {'save_interval': '500'}}]}, 'DebugRuleConfigurations': [{'RuleConfigurationName': 'VanishingGradient', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'VanishingGradient'}}, {'RuleConfigurationName': 'Overfit', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'Overfit'}}, {'RuleConfigurationName': 'Overtraining', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'Overtraining'}}, {'RuleConfigurationName': 'PoorWeightInitialization', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'PoorWeightInitialization'}}], 'DebugRuleEvaluationStatuses': [{'RuleConfigurationName': 'VanishingGradient', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-vanishinggradient-d9566825', 'RuleEvaluationStatus': 'Error', 'StatusDetails': 'InternalServerError: We encountered an internal error. Please try again.', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 4, 277000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'Overfit', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-overfit-c5e75be0', 'RuleEvaluationStatus': 'NoIssuesFound', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 4, 277000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'Overtraining', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-overtraining-d9fd7780', 'RuleEvaluationStatus': 'IssuesFound', 'StatusDetails': 'RuleEvaluationConditionMet: Evaluation of the rule Overtraining at step 116 resulted in the condition being met\n', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 4, 277000, tzinfo=tzlocal())}, {'RuleConfigurationName': 'PoorWeightInitialization', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-poorweightinitialization-5bed3620', 'RuleEvaluationStatus': 'Error', 'StatusDetails': 'InternalServerError: We encountered an internal error. Please try again.', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 4, 277000, tzinfo=tzlocal())}], 'ProfilerConfig': {'S3OutputPath': 's3://sagemaker-us-west-2-232496288858/', 'ProfilingIntervalInMilliseconds': 500, 'ProfilingParameters': {'DataloaderProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "MetricsRegex": ".*", }', 'DetailedProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'FileOpenFailThreshold': '50', 'HorovodProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }', 'LocalPath': '/opt/ml/output/profiler', 'PythonProfilingConfig': '{"StartStep": 0, "NumSteps": 10, "ProfilerName": "cprofile", "cProfileTimer": "total_time", }', 'RotateFileCloseIntervalInSeconds': '60', 'RotateMaxFileSizeInBytes': '10485760', 'SMDataParallelProfilingConfig': '{"StartStep": 0, "NumSteps": 10, }'}, 'DisableProfiler': False}, 'ProfilerRuleConfigurations': [{'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluatorImage': '895741380848.dkr.ecr.us-west-2.amazonaws.com/sagemaker-debugger-rules:latest', 'VolumeSizeInGB': 0, 'RuleParameters': {'rule_to_invoke': 'ProfilerReport'}}], 'ProfilerRuleEvaluationStatuses': [{'RuleConfigurationName': 'ProfilerReport', 'RuleEvaluationJobArn': 'arn:aws:sagemaker:us-west-2:232496288858:processing-job/pytorch-training-2023-02-0-profilerreport-f095e87b', 'RuleEvaluationStatus': 'NoIssuesFound', 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 52, 16, 672000, tzinfo=tzlocal())}], 'ProfilingStatus': 'Enabled', 'ResponseMetadata': {'RequestId': '53eeb283-6c04-4416-8b65-c8c3f17946fa', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amzn-requestid': '53eeb283-6c04-4416-8b65-c8c3f17946fa', 'content-type': 'application/x-amz-json-1.1', 'content-length': '7013', 'date': 'Thu, 02 Feb 2023 21:52:27 GMT'}, 'RetryAttempts': 0}}

TODO: Is there some anomalous behaviour in your debugging output? If so, what is the error and how will you fix it?
TODO: If not, suppose there was an error. What would that error look like and how would you have fixed it?

In [37]:
from smdebug.trials import create_trial
from smdebug.core.modes import ModeKeys
#creating a trial
trial = create_trial(estimator.latest_job_debugger_artifacts_path())
[2023-02-02 21:53:30.297 datascience-1-0-ml-t3-medium-fbbacbd136ea35c00e5ce9203df8:18 INFO utils.py:27] RULE_JOB_STOP_SIGNAL_FILENAME: None
[2023-02-02 21:53:30.330 datascience-1-0-ml-t3-medium-fbbacbd136ea35c00e5ce9203df8:18 INFO s3_trial.py:42] Loading trial debug-output at path s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/debug-output
In [38]:
trial.tensor_names() #all the tensor names
[2023-02-02 21:53:49.053 datascience-1-0-ml-t3-medium-fbbacbd136ea35c00e5ce9203df8:18 INFO trial.py:198] Training has ended, will refresh one final time in 1 sec.
[2023-02-02 21:53:50.073 datascience-1-0-ml-t3-medium-fbbacbd136ea35c00e5ce9203df8:18 INFO trial.py:210] Loaded all steps
Out[38]:
['CrossEntropyLoss_output_0',
 'gradient/ResNet_fc.0.bias',
 'gradient/ResNet_fc.0.weight',
 'gradient/ResNet_fc.2.bias',
 'gradient/ResNet_fc.2.weight',
 'layer1.0.relu_input_0',
 'layer1.0.relu_input_1',
 'layer1.0.relu_input_2',
 'layer1.1.relu_input_0',
 'layer1.1.relu_input_1',
 'layer1.1.relu_input_2',
 'layer1.2.relu_input_0',
 'layer1.2.relu_input_1',
 'layer1.2.relu_input_2',
 'layer2.0.relu_input_0',
 'layer2.0.relu_input_1',
 'layer2.0.relu_input_2',
 'layer2.1.relu_input_0',
 'layer2.1.relu_input_1',
 'layer2.1.relu_input_2',
 'layer2.2.relu_input_0',
 'layer2.2.relu_input_1',
 'layer2.2.relu_input_2',
 'layer2.3.relu_input_0',
 'layer2.3.relu_input_1',
 'layer2.3.relu_input_2',
 'layer3.0.relu_input_0',
 'layer3.0.relu_input_1',
 'layer3.0.relu_input_2',
 'layer3.1.relu_input_0',
 'layer3.1.relu_input_1',
 'layer3.1.relu_input_2',
 'layer3.2.relu_input_0',
 'layer3.2.relu_input_1',
 'layer3.2.relu_input_2',
 'layer3.3.relu_input_0',
 'layer3.3.relu_input_1',
 'layer3.3.relu_input_2',
 'layer3.4.relu_input_0',
 'layer3.4.relu_input_1',
 'layer3.4.relu_input_2',
 'layer3.5.relu_input_0',
 'layer3.5.relu_input_1',
 'layer3.5.relu_input_2',
 'layer4.0.relu_input_0',
 'layer4.0.relu_input_1',
 'layer4.0.relu_input_2',
 'layer4.1.relu_input_0',
 'layer4.1.relu_input_1',
 'layer4.1.relu_input_2',
 'layer4.2.relu_input_0',
 'layer4.2.relu_input_1',
 'layer4.2.relu_input_2',
 'relu_input_0']
In [39]:
len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.TRAIN))
Out[39]:
21
In [40]:
len(trial.tensor("CrossEntropyLoss_output_0").steps(mode=ModeKeys.EVAL))
Out[40]:
28
In [41]:
#Defining some utility functions to be used for plotting tensors
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import host_subplot

#utility function to get data from tensors
def get_data(trial, tname, mode):
    tensor = trial.tensor(tname)
    steps = tensor.steps(mode=mode)
    vals = []
    for s in steps:
        vals.append(tensor.value(s, mode=mode))
    return steps, vals

#plot tensor utility functions for plotting tensors
def plot_tensor(trial, tensor_name):

    steps_train, vals_train = get_data(trial, tensor_name, mode=ModeKeys.TRAIN)
    print("loaded TRAIN data")
    steps_eval, vals_eval = get_data(trial, tensor_name, mode=ModeKeys.EVAL)
    print("loaded EVAL data")

    fig = plt.figure(figsize=(10, 7))
    host = host_subplot(111)

    par = host.twiny()

    host.set_xlabel("Steps (TRAIN)")
    par.set_xlabel("Steps (EVAL)")
    host.set_ylabel(tensor_name)

    (p1,) = host.plot(steps_train, vals_train, label=tensor_name)
    print("Completed TRAIN plot")
    (p2,) = par.plot(steps_eval, vals_eval, label="val_" + tensor_name)
    print("Completed EVAL plot")
    leg = plt.legend()

    host.xaxis.get_label().set_color(p1.get_color())
    leg.texts[0].set_color(p1.get_color())

    par.xaxis.get_label().set_color(p2.get_color())
    leg.texts[1].set_color(p2.get_color())

    plt.ylabel(tensor_name)
    plt.show()
In [42]:
#plotting the tensor
plot_tensor(trial, "CrossEntropyLoss_output_0");
loaded TRAIN data
loaded EVAL data
Completed TRAIN plot
Completed EVAL plot
In [43]:
# TODO: Display the profiler output
rule_output_path = estimator.output_path + estimator.latest_training_job.job_name + "/rule-output"
print(f"Profiler report location: {rule_output_path}")
Profiler report location: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output
In [44]:
! aws s3 ls {rule_output_path} --recursive
2023-02-02 21:52:02     380974 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-report.html
2023-02-02 21:52:02     230126 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb
2023-02-02 21:51:58        191 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json
2023-02-02 21:51:58      13612 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
2023-02-02 21:51:58        126 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json
2023-02-02 21:51:58        129 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json
2023-02-02 21:51:58       1008 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json
2023-02-02 21:51:58        309 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json
2023-02-02 21:51:58        153 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json
2023-02-02 21:51:58        232 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/MaxInitializationTime.json
2023-02-02 21:51:58       1057 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json
2023-02-02 21:51:58        610 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallSystemUsage.json
2023-02-02 21:51:58       2462 pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/StepOutlier.json
In [45]:
! aws s3 cp {rule_output_path} ./ --recursive
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/BatchSize.json to ProfilerReport/profiler-output/profiler-reports/BatchSize.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json to ProfilerReport/profiler-output/profiler-reports/CPUBottleneck.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-report.ipynb to ProfilerReport/profiler-output/profiler-report.ipynb
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/Dataloader.json to ProfilerReport/profiler-output/profiler-reports/Dataloader.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json to ProfilerReport/profiler-output/profiler-reports/LoadBalancing.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json to ProfilerReport/profiler-output/profiler-reports/OverallFrameworkMetrics.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json to ProfilerReport/profiler-output/profiler-reports/GPUMemoryIncrease.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/OverallSystemUsage.json to ProfilerReport/profiler-output/profiler-reports/OverallSystemUsage.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json to ProfilerReport/profiler-output/profiler-reports/LowGPUUtilization.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/MaxInitializationTime.json to ProfilerReport/profiler-output/profiler-reports/MaxInitializationTime.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/StepOutlier.json to ProfilerReport/profiler-output/profiler-reports/StepOutlier.json
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-report.html to ProfilerReport/profiler-output/profiler-report.html
download: s3://sagemaker-us-west-2-232496288858/pytorch-training-2023-02-02-21-42-08-475/rule-output/ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json to ProfilerReport/profiler-output/profiler-reports/IOBottleneck.json
In [48]:
import os
import IPython

# get the autogenerated folder name of profiler report
profiler_report_name = [
    rule["RuleConfigurationName"]
    for rule in estimator.latest_training_job.rule_job_summary()
    if "Profiler" in rule["RuleConfigurationName"]
][0]
In [47]:
# Zipping the ProfilerReport inorder to export and upload it later for submission
import shutil
shutil.make_archive("./profiler_report", "zip", "ProfilerReport")
Out[47]:
'/root/CD0387-deep-learning-topics-within-computer-vision-nlp-project-starter/profiler_report.zip'

Model Deploying¶

In [49]:
# TODO: Deploy your model to an endpoint

predictor = estimator.deploy(initial_instance_count=1, instance_type="ml.p3.2xlarge")
--------!
In [54]:
from sagemaker.pytorch import PyTorchModel
from sagemaker.predictor import Predictor

#Below is the s3 location of our saved model that was trained by the training job using the best hyperparameters
model_data_artifacts = "s3://sagemaker-us-west-2-232496288858/pytorch-training-230202-2107-002-dcffdac6/output/model.tar.gz"

#We need to define the serializer and deserializer that we will be using as default for our Prediction purposes
jpeg_serializer = sagemaker.serializers.IdentitySerializer("image/jpeg")
json_deserializer = sagemaker.deserializers.JSONDeserializer()

#If we need to override the serializer and deserializer then we need to pass them in an class inheriting the Predictor class and pass this class as parameter to our PyTorchModel
class ImgPredictor(Predictor):
    def __init__( self, endpoint_name, sagemaker_session):
        super( ImgPredictor, self).__init__(
            endpoint_name,
            sagemaker_session = sagemaker_session,
            serializer = jpeg_serializer,
            deserializer = json_deserializer
        )
        
pytorch_model = PyTorchModel( model_data = model_data_artifacts,
                            role = role,
                             entry_point= "endpoint_inference.py",
                             py_version = "py36",
                             framework_version = "1.6",
                            predictor_cls = ImgPredictor
                            )

predictor = pytorch_model.deploy( initial_instance_count = 1, instance_type = "ml.t2.medium") #Using ml.t2.medium to save costs
-------------!
In [55]:
#Testing the deployed endpoint using some test images
#Solution 1: Using the Predictor object directly.
from PIL import Image
import io
import os
import numpy as np

test_dir = "./dogImages/test/129.Tibetan_mastiff/"
test_images = ["Tibetan_mastiff_08158.jpg", "Tibetan_mastiff_08139.jpg", "Tibetan_mastiff_08138.jpg"]
test_images_expected_output = [129, 5, 21 ]
for index in range(len(test_images) ):
    test_img = test_images[index]
    expected_breed_category = test_images_expected_output[index]
    print(f"Test image no: {index+1}")
    test_file_path = os.path.join(test_dir,test_img)
    with open(test_file_path , "rb") as f:
        payload = f.read()
        print("Below is the image that we will be testing:")
        display(Image.open(io.BytesIO(payload)))
        print(f"Expected dog breed category no : {expected_breed_category}")
        response = predictor.predict(payload, initial_args={"ContentType": "image/jpeg"})
        print(f"Response: {response}")
        predicted_dog_breed = np.argmax(response,1) + 1 #We need to do plus 1 as index starts from zero and prediction is zero-indexed
        print(f"Response/Inference for the above image is : {predicted_dog_breed}")
        print("----------------------------------------------------------------------")
Test image no: 1
Below is the image that we will be testing:
Expected dog breed category no : 129
Response: [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.662408173084259, 0.05551016703248024, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.21218474209308624, 0.0, 0.0, 0.0, 0.0, 0.9146386384963989, 0.5780836939811707, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.1485905647277832, 0.0, 0.01837325096130371, 0.0, 0.0, 0.0, 0.3390808701515198, 0.0, 0.0, 0.0, 0.8508307933807373, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.6743180751800537, 0.0, 0.0, 1.4576091766357422, 0.9207280278205872, 0.0, 0.0, 0.0, 0.9517050981521606, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.7739618420600891, 0.5983347296714783, 0.0, 0.0, 0.2721315026283264, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.917948842048645, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0662590265274048, 0.0, 0.0, 0.0, 0.0, 0.5779383778572083, 0.5139908790588379, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.4284725189208984, 0.0, 0.0, 0.0, 0.0]]
Response/Inference for the above image is : [129]
----------------------------------------------------------------------
Test image no: 2
Below is the image that we will be testing:
Expected dog breed category no : 5
Response: [[0.0, 0.08648087084293365, 0.07022416591644287, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5109014511108398, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.011233434081077576, 0.0, 0.0, 0.0, 0.0, 1.5123292207717896, 0.7675759792327881, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.2232207953929901, 0.0, 0.40191733837127686, 0.0, 0.0, 0.0, 0.8265922665596008, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.08851224929094315, 0.0, 1.5181233882904053, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.390899181365967, 0.0, 0.0, 1.4006421566009521, 1.336551547050476, 0.0, 0.0, 0.0, 0.6995121836662292, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.9968060255050659, 1.3887072801589966, 0.0, 0.0, 0.3057568073272705, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.5242455005645752, 0.0, 0.0, 0.0, 0.0, 0.0, 2.0574803352355957, 0.0, 0.0, 0.0, 0.0, 0.48372882604599, 0.5821180939674377, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 2.226931571960449, 0.0, 0.0, 0.0, 0.0]]
Response/Inference for the above image is : [76]
----------------------------------------------------------------------
Test image no: 3
Below is the image that we will be testing:
Expected dog breed category no : 21
Response: [[0.0, 0.47988080978393555, 0.0534096360206604, 0.0, 2.029130697250366, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.8579487204551697, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.42556288838386536, 0.5418018102645874, 0.0, 0.0, 0.0, 0.19376616179943085, 0.4305182993412018, 0.5111054182052612, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4361092150211334, 0.0, 0.0, 0.0, 0.0, 0.0, 0.06514547765254974, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.21120649576187134, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.32620373368263245, 0.904705286026001, 0.0, 0.0, 0.0, 0.09315355867147446, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.6515793204307556, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.7480322122573853, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.647281527519226, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.4045291244983673, 0.0, 0.0, 0.0, 0.0, 3.648926258087158, 0.0, 0.0, 0.0, 0.0]]
Response/Inference for the above image is : [129]
----------------------------------------------------------------------
In [56]:
print(predictor.endpoint_name)
endpoint_name = predictor.endpoint_name
pytorch-inference-2023-02-02-22-15-13-553
In [57]:
# Solution 2: Using boto3
# Using the runtime boto3 client to test the deployed models endpoint
import os
import io
import boto3
import json
import base64
import PIL
# setting the  environment variables

ENDPOINT_NAME = endpoint_name
# We will be using the AWS's lightweight runtime solution to invoke an endpoint.
runtime= boto3.client('runtime.sagemaker')
test_dir = "./dogImages/test/129.Tibetan_mastiff/"
test_images = ["Tibetan_mastiff_08158.jpg", "Tibetan_mastiff_08139.jpg", "Tibetan_mastiff_08138.jpg"]
test_images_expected_output = [129, 5, 21 ]
for index in range(len(test_images) ):
    test_img = test_images[index]
    expected_breed_category = test_images_expected_output[index]
    print(f"Test image no: {index+1}")
    test_file_path = os.path.join(test_dir,test_img)
    with open(test_file_path , "rb") as f:
        payload = f.read()
        print("Below is the image that we will be testing:")
        display(Image.open(io.BytesIO(payload)))
        print(f"Expected dog breed category no : {expected_breed_category}")
        response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
                                       ContentType='image/jpeg',
                                       Body=payload)
        response_body = np.asarray(json.loads( response['Body'].read().decode('utf-8')))        
        print(f"Response: {response_body}")        
        predicted_dog_breed = np.argmax(response_body,1) + 1 #We need to do plus 1 as index starts from zero and prediction is zero-indexed
        print(f"Response/Inference for the above image is : {predicted_dog_breed}")
Test image no: 1
Below is the image that we will be testing:
Expected dog breed category no : 129
Response: [[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.66240817
  0.05551017 0.         0.         0.         0.         0.
  0.         0.         0.         0.21218474 0.         0.
  0.         0.         0.91463864 0.57808369 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         1.14859056 0.         0.01837325 0.         0.
  0.         0.33908087 0.         0.         0.         0.85083079
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         1.67431808 0.         0.
  1.45760918 0.92072803 0.         0.         0.         0.9517051
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.77396184 0.59833473
  0.         0.         0.2721315  0.         0.         0.
  0.         0.         0.         1.91794884 0.         0.
  0.         0.         0.         1.06625903 0.         0.
  0.         0.         0.57793838 0.51399088 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         2.42847252 0.         0.         0.
  0.        ]]
Response/Inference for the above image is : [129]
Test image no: 2
Below is the image that we will be testing:
Expected dog breed category no : 5
Response: [[0.         0.08648087 0.07022417 0.         0.         0.
  0.         0.         0.         0.         0.         0.51090145
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.01123343 0.         0.
  0.         0.         1.51232922 0.76757598 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.2232208  0.         0.40191734 0.         0.
  0.         0.82659227 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.08851225
  0.         1.51812339 0.         0.         0.         0.
  0.         0.         0.         3.39089918 0.         0.
  1.40064216 1.33655155 0.         0.         0.         0.69951218
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.99680603 1.38870728
  0.         0.         0.30575681 0.         0.         0.
  0.         0.         0.         1.5242455  0.         0.
  0.         0.         0.         2.05748034 0.         0.
  0.         0.         0.48372883 0.58211809 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         2.22693157 0.         0.         0.
  0.        ]]
Response/Inference for the above image is : [76]
Test image no: 3
Below is the image that we will be testing:
Expected dog breed category no : 21
Response: [[0.         0.47988081 0.05340964 0.         2.0291307  0.
  0.         0.         0.         0.         0.         0.85794872
  0.         0.         0.         0.         0.         0.
  0.         0.         0.42556289 0.54180181 0.         0.
  0.         0.19376616 0.4305183  0.51110542 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.43610922 0.         0.         0.         0.
  0.         0.06514548 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.2112065
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.32620373 0.90470529 0.         0.         0.         0.09315356
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.65157932 0.         0.         0.
  0.         0.         0.         1.74803221 0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         1.64728153 0.         0.         0.
  0.         0.         0.         0.40452912 0.         0.
  0.         0.         3.64892626 0.         0.         0.
  0.        ]]
Response/Inference for the above image is : [129]
In [58]:
# TODO: Remember to shutdown/delete your endpoint once your work is done
predictor.delete_endpoint()
In [ ]: